Automated languages phylogeny from Levenshtein distance
نویسنده
چکیده
Languages evolve in time according to a process in which reproduction, mutation and extinction are all possible. This is very similar to haploid evolution for asexual organisms or for mtDNA of complex ones. Exploiting this similarity it is possible, in principle, to verify hypotheses concerning their relationship. The key point is the definition of the distance among pairs of languages in analogy with the genetic distance among pairs of organisms. Assuming that vocabulary is the analogue of DNA, distances can be evaluated from lexical differences. This concept seems to have its roots in the work of the French explorer Dumont D’Urville. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. The method used by modern glottochronology, developed by Morris Swadesh in the 1950s [20], measures distances from the percentage of shared cognates, which are words with a common historical origin. The weak point of this method is that subjective judgment plays a relevant role. In fact, even if cognacy decisions are made by trained and experienced linguists, the task of counting the number of cognate words in a list is far from being trivial and results may vary for different studies. Furthermore, these decisions may imply an enormous working time. Recently, we proposed a new automated method which has some advantages: the first is that it avoids subjectivity, the second is that results can be replicated by other scholars assuming that the database is the same, the third is that no specific linguistic knowledge is requested, and the last, but surely not the least, is that it allows for rapid comparison of a very large number of languages. The distance between two languages is defined by considering a renormalized Levenshtein (or edit) distance among words with the same meaning and averaging on the words contained in a list of 200 features. The renormalization, which takes into account the words length, plays a crucial role, and no sensible results can be found without it. Assuming a constant rate of mutation, these lexical distances are logarithmically proportional, in average, to the genealogical ones. Then, phylogenetic trees can be reconstructed from the matrix which contains the genealogical distances between all pairs of languages in a family. We applied our method to the Indo-European and to the Austronesian groups considering, in both cases, fifty different languages. From the two matrices, we obtained two genealogical trees using the Unweighted Pair Group Method Average (UPGMA) [19]. The trees are similar to those found by previous research [6, 7] with some important differences concerning the position of few languages and subgroups. Indeed, we think that these differences carry some fresh information about the structure of the tree and about the phylogenetic relations inside the families. Automated languages phylogeny from Levenshtein distance 2
منابع مشابه
Automated Word Stability and Language Phylogeny
The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D’Urville (1832). He collected comparative word lists of various languages during his voyages aboard the Astrolabe from 1826 to1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relationship among languages. The metho...
متن کاملPhylogeny and geometry of languages from normalized Levenshtein distance
The idea that the distance among pairs of languages can be evaluated from lexical differences seems to have its roots in the work of the French explorer Dumont D’Urville. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of...
متن کاملLevenshtein Distances Fail to Identify Language Relationships Accurately
The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from...
متن کاملAutomated words stability and languages phylogeny
The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville (D'Urville 1832). He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. Th...
متن کاملGraphonological Levenshtein Edit Distance: Application for Automated Cognate Identification
This paper presents a methodology for calculating a modified Levenshtein edit distance between character strings, and applies it to the task of automated cognate identification from nonparallel (comparable) corpora. This task is an important stage in developing MT systems and bilingual dictionaries beyond the coverage of traditionally used aligned parallel corpora, which can be used for finding...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/0911.3280 شماره
صفحات -
تاریخ انتشار 2009